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Abstract 

We show how universal codes can be used for solving some of the 
most important statistical problems for time series. By definition, a 
universal code (or a universal lossless data compressor) can compress 
any sequence generated by a stationary and ergodic source asymptot- 
ically to the Shannon entropy, which, in turn, is the best achievable 
ratio for lossless data compressors. 

We consider finite-alphabet and real- valued time series and the fol- 
lowing problems: estimation of the limiting probabilities for finite- 
alphabet time series and estimation of the density for real-valued time 
series, the on-line prediction, regression, classification (or problems 
with side information) for both types of the time series and the follow- 
ing problems of hypothesis testing: goodness-of-fit testing, or identity 
testing, and testing of serial independence. It is important to note 
that all problems are considered in the framework of classical mathe- 
matical statistics and, on the other hand, everyday methods of data 
compression (or archivers) can be used as a tool for the estimation and 
testing. 

It turns out, that quite often the suggested methods and tests are 
more powerful than known ones when they are applied in practice. 

1 Introduction 

Since C. Shannon published the paper "A mathematical theory of commu- 
nication" [47], the ideas and results of Information Theory have played an 
important role in cryptography [26i|48j, mathematical statistics [3} [Sj [25]. 
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and many other fields [6l[7], which are far from telecommunications. Univer- 
sal coding, which is a part of Information Theory, also has been efficiently 
applied in many fields since its discovery |2H I13j . Thus, application of re- 
sults of universal coding, initiated in 1988 [35], created a new approach to 
prediction [U [TUJ EH ES]. Maybe the most unexpected application of data 
compression ideas arises in experiments that show that some ant species are 
capable of compressing messages and are capable of adding and subtracting 
small numbers \30[ B3] . 

In this chapter we describe a new approach to estimation, prediction 
and hypothesis testing for time series, which was suggested recently [351 [381 
I42j . This approach is based on ideas of universal coding (or universal data 
compression). We would like to emphasize that everyday methods of data 
compression (or archivers) can be directly used as a tool for estimation and 
hypothesis testing. It is important to note that the modern archivers (like 
zip, arj, rar, etc.) are based on deep theoretical results of the source coding 
theory [10\ [20l [Ml \32\ B6] and have shown their high efficiency in practice 
because archivers can find many kinds of latent regularities and use them 
for compression. 

It is worth noting that this approach was applied to the problem of ran- 
domness testing [32]. This problem is quite important for practice; in partic- 
ular, the National Institute of Standards and Technology of USA (NIST)has 
suggested "A statistical test suite for random and pseudorandom number 
generators for cryptographic applications" [33], which consists of 16 tests. 
It has turned out that tests which are based on universal codes are more 
powerful than the tests suggested by NIST [32] . 

The outline of this paper is as follows. The next section contains some 
necessary definitions and facts about predictors, codes, hypothesis testing 
and description of one universal code. The section [3] and [H are devoted to 
problems of estimation and hypothesis testing, correspondingly, for the case 
of finite-alphabet time series. The case of infinite alphabets is considered 
in the [5] section. All proofs are given in Appendix, but some intuitive 
indication are given in the body of the paper. 

2 Definitions and Statements of the Problems 

2.1 Estimation and Prediction for I.I.D. Sources 

First we consider a source with unknown statistics which generates sequences 
xiX2 • • • of letters from some set (or alphabet) A. It will be convenient now 
to describe briefly the prediction problem. Let the source generate a message 
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xi . . . xt-ixt, Xi £ A for all i, and the following letter xt+i needs to be pre- 
dicted. This problem can be traced back to Laplace [HI [29] who considered 
the problem of estimation of the probability that the sun will rise tomorrow, 
given that it has risen every day since Creation. In our notation the alpha- 
bet A contains two letters {^Hhe sunrises") and 1 {"the sun does not rise'"), 
t is the number of days since Creation, xi . . . xt-ixt = 00 ... 0. 
Laplace suggested the following predictor: 

Lo(a|xi • • • xt) = {vx^-xAa) + l)/(t + 1^1), (1) 

where i^xr-xt {o) denotes the count of letter a occurring in the word xi . . . xt-ixt- 
It is important to note that the predicted probabilities cannot be equal to 
zero even through a certain letter did not occur in the word xi . . . xt-^ixt- 

Example. Let A = {0, 1}, X1...X5 = 01010, then the Laplace prediction 
is as follows: Lo{xe = 0|xi...X5 = 01010) = (3 + l)/(5 + 2) = 4/7,Lo(a;6 = 
l\xi...X5 = 01010) = (2 + l)/(5 + 2) = 3/7. In other words, 3/7 and 4/7 
are estimations of the unknown probabilities P{xt+i = 0\xi . . .xt = 01010) 
and P{xt+i = l\xi . . . xt = 01010). (In what follows we will use the shorter 
notation: P(0|01010) and P(l|01010)). 

We can see that Laplace considered prediction as a set of estimations of 
unknown (conditional) probabilities. This approach to the problem of pre- 
diction was developed in 1988 [35] and now is often called on-line prediction 
or universal prediction [nil9 t[27ll28j . As we mentioned above, it seems nat- 
ural to consider conditional probabilities to be the best prediction, because 
they contain all information about the future behavior of the stochastic 
process. Moreover, this approach is deeply connected with game-theoretical 
interpretation of prediction p^l37j and, in fact, all obtained results can be 
easily transferred from one model to the other. 

Any predictor 7 defines a measure (or an estimation of probability) by 
the following equation 

t 

7(xi...xt) = Jj7(xi|xi...Xi_i). (2) 

i=l 

And, vice versa, any measure 7 (or estimation of probability) defines a 
predictor: 

j{xi\xi... Xi-i) = 7(xi... Xi_iXi)/j{xi... Xi-i). (3) 

Example. Let us apply the Laplace predictor for estimation of probabili- 
ties of the sequences 01010 and 010101. From 1^ we obtain Lo(OlOlO) = 
^^111 = -Lo(OlOlOl) = ^1 = jjQ. Vice versa, if for some measure (or 
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a probability estimation) x we have x(OlOlO) = ^ and x(OlOlOl) = 
then we obtain from ([3]) the fohowing prediction, or the estimation of the 
conditional probability, x(l|01010) = ^(Jqq = f • 

Now we concretize the class of stochastic processes which will be consid- 
ered. Generally speaking, we will deal with so-called stationary and ergodic 
time series (or sources), whose definition will be given later, but now we 
consider may be the simplest class of such processes, which are called i.i.d. 
sources. By definition, they generate independent and identically distributed 
random variables from some set A. In our case A will be either some alphabet 
or a real- valued interval. 

The next natural question is how to measure the errors of prediction 
and estimation of probability. Mainly we will measure these errors by the 
Kullback-Leibler (KL) divergence which is defined by 

D{P,Q) = Y,P{a)\og^ (4) 

where P{a) and Q{a) are probability distributions over an alphabet A (here 
and below log = log2 and log = 0). The probability distribution P{a) can 
be considered as unknown whereas Q{a) is its estimation. It is well-known 
that for any distributions P and Q the KL divergence is nonnegative and 
equals if and only if P{a) = Q{a) for all a [H]. So, if the estimation Q is 
equal to P, the error is 0, otherwise the error is a positive number. 

The KL divergence is connected with the so-called variation distance 

\\P-Q\\ = Y^\P{a)-Q{a)\, 
via the the following inequality (Pinsker's inequality) 

Let 7 be a predictor, i.e. an estimation of an unknown conditional proba- 
bility and xi • • • Xi be a sequence of letters created by an unknown source 
P. The KL divergence between P and the predictor 7 is equal to 

p^,p{xi ■■■xt) = } P{a\xi ■■■xt) log — -, (6) 

t^A 7(aki---Xi) 

For fixed t it is a random variable, because xi,X2, - ■ ■ ,Xt are random vari- 
ables. We define the average error at time t by 

p\P\\j) = E {p.,pi-)) = Yl Pixi---xt) p^A^i---xt) (7) 
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P{xi---xt) } P{a\xi---xt)log—— f. 

^ "fialxi ■ ■ ■ xt) 

Analogously, if 7( ) is an estimation of a probability distribution we define 
the errors per letter as follows: 

p^^p{xx...xt) = t'^ {log{P{xi...xt)/j{xi...xt)) (8) 

and 

p\P\\^) = t-^ P{xi...xt)\og{P{xi...xt)h{xi...xt)), (9) 

xi...xteA' 

where, as before, ^{xi...Xt) = 11^=1 7(^il^i---^«-i)- (Here and below we de- 
note by vl* and A* the set of all words of length t over A and the set of all 
finite words over A correspondingly: A* = Ui^i ^*-) 

Claim 1 ([35j). For any i.i.d. source P generating letters from an alphabet 
A and an integer t the average error ^ of the Laplace predictor and the 
average error of the Laplace estimator are upper bounded as follows: 

p\P\\Lo) < ((1^1 -l)loge)/(t + l), (10) 

p\P\\Lo) < (1^1 -l)logt/t + 0(l/t), (11) 
where e ~ 2.718 is the Euler number. 

So, we can see that the average error of the Laplace predictor goes to 
zero for any i.i.d. source P when the length t of the sample xi - ■ ■ Xt tends 
to infinity. Such methods are called universal, because the error goes to 
zero for any source, or process. In this case they are universal for the set of 
all i.i.d. sources generating letters from the finite alphabet A, but later we 
consider universal estimators for the set of stationary and ergodic sources. 
It is worth noting that the first universal code for which the estimation (jlip 
is valid, was suggested independently by Fitingof [13] and Kolmogorov [2T] 
in 1966. 

The value 

p\P\\^)=t-^ P{xi...xt)\og{P{xi...xt)h{xi...xt)) 

Xi...Xt&A^ 

has one more interpretation connected with data compression. Now we 
consider the main idea whereas the more formal definitions will be given 
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later. First we recall the definition of the Shannon entropy ho{P) for an 
i.i.d. source P 

ho{P) = -Y,P{a)\ogP{a). (12) 

agA 

It is easy to see that t^^ xt€A^ P{xi---Xt)log{P{xi...xt)) = —ho{P) for 
the i.i.d. source. Hence, we can represent the average error p*(P||7) in ([9]) 
as 

p*(P||7) = t-' P(^i-^t) log(l/7(xi...xj)) - ho{P). 

xi...xt&A^ 

More formal and general consideration of universal codes will be given later, 
but here we briefly show how estimations and codes are connected. The 
point is that one can construct a code with codelength 7code(ffl|2;i • • ■ xt) ~ 
— log2 7(a|xi • • • Xn) for any letter a G A (since Shannon's original research, 
it has been well known that, using block codes with large block length or 
more modern methods of arithmetic coding [3T] , the approximation may 
be as accurate as you like). If one knows the real distribution P, one can 
base coding on the true distribution P and not on the prediction 7. The 
difference in performance measured by average code length is given by 

Pia\xi • • • xt){- log2 7(a|a;i • • • xt))-^ P(a|xi • • • xt){- loga P(a|xi • • • xt)) 

r,/ I M P{a\xi---xt) 
= P[a\xi ■■■Xt) log2 —- -. 

Thus this excess is exactly the error defined above ^ . Analogously, if we 
encode the sequence xi . . . xt based on a predictor 7 the redundancy per 
letter is defined by ([8|) and So, from mathematical point of view, the 
estimation of the limiting probabilities and universal coding are identical. 
But — log7(xi...a;() and — logP(xi...xj) have a very natural interpretation. 
The first value is a code word length (in bits), if the "code" 7 is applied 
for compressing the word xi...xt and the second one is the minimal possible 
codeword length. The difference is the redundancy of the code and, at 
the same time, the error of the predictor. It is worth noting that there are 
many other deep interrelations between the universal coding, prediction and 
estimation [321 [35] . 

We can see from the claim and the Pinsker inequality ([5]) that the vari- 
ation distance of the Laplace predictor and estimator goes to zero, too. 
Moreover, it can be easily shown that the error ^ (and the corresponding 
variation distance) goes to zero with probability 1, when t goes to infinity. 
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(Informally, it means that the error ([6]) goes to zero for almost all sequences 
xi - ■ ■ xt according to the measure P. ) Obviously, such properties are very 
desirable for any predictor and for larger classes of sources, like Markov 
and stationary ergodic (they will be briefly defined in the next subsection). 
However, it is proven that such predictors do not exist for the class of 
all stationary and ergodic sources (generating letters from a given finite al- 
phabet). More precisely, if, for example, the alphabet has two letters, then 
for any predictor 7 and for any 6 > there exists a source P such that with 
probability 1 p^^p{xi ■ ■ ■ xt) > 1/2 — 6 infinitely often when t — > 00. In other 
words, the error of any predictor may not go to 0, if the predictor is applied 
to an arbitrary stationary and ergodic source, that is why it is difficult to use 
([6]) and d?]) to compare different predictors. On the other hand, it is shown 
|35) that there exists a predictor R, such that the following Cesaro average 
=1 PR,p{xi ■ ■ ■ Xi) goes to (with probability 1) for any stationary and 
ergodic source P, where t goes to infinity. (This predictor will be described 
in the next subsection.) That is why we will focus our attention on such 
averages. From the definitions ([6]), d?]) and properties of the logarithm we 
can see that for any probability distribution 7 

t 

^ P^,p{xi ■■■Xi)= {log{P{xi...xt)/-f{xi...xt)), 
1=1 

t 

J^p^Ph) =ri E Pixi-Xt)log{P{xi...xt)/^{xi...xt)). 

i=l xi...xt&A^ 

Taking into account these equations, we can see from the definitions ([8]) 
and ([9]) that the Chesaro averages of the prediction errors ([6]) and ([7|) are 
equal to the errors of estimation of limiting probabilities ([8]) and ([9]). That 
is why we will use values ([8]) and ([9]) as the main measures of the precision 
throughout the chapter. 

A natural problem is to find a predictor and an estimator of the lim- 
iting probabilities whose average error ([9|) is minimal for the set of i.i.d. 
sources. This problem was considered and solved by Krichevsky [231 I24j . 
He suggested the following predictor: 

KQ{a\xi---xt) = {vx,-xAa) + l/2)/{t + \A\/2), (13) 

where, as before, Vxi---xt{(^) is the number of occurrencies of the letter a in 
the word xi . . .xt- We can see that the Krychevsky predictor is quite close 
to the Laplace's one (|35]l . 
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Example. Let A = {0,1}, xi...x^ = 01010. Then i^o(a;6 = 0|01010) = 
(3 + l/2)/(5 + 1) = 1/12,Kq{xq = 1|01010) = (2 + l/2)/(5 + 1) = 5/12 and 
^o(OlOlO) = illfi = Jg. 

The Krichevsky measure Kq can be represented as follows: 

„, , ^ •'„......M)+V2 n„6^(n;5"'°'(j-i/2)) ,,,, 



i=l 

It is known that 



(r + l/2)((r + 1) + l/2)...(. - 1/2) = (15) 
where r( ) is the gamma function [22j. So, (jl4p can be presented as follows: 

, nagA(rK...:^.(«) + i/2)/r(i/2)) 
"""^"^••"^^ = r(t + |A|/2)/r(|^|/2) • (^^^ 

The following claim shows that the error of the Krichevsky estimator is a 
half of the Laplace's one. 

Claim 2. For any i.i.d. source P generating letters from a finite alphabet 
A the average error (0) of the estimator Kq is upper hounded as follows: 

pt{Ko,P)=t-' P{xi...xt)\og{P{xi...xt)/KQ{xi...xt)) = 

t-^ P{xi...xt)log{l/Ko{xi...xt))-ho{p)<m\-^)^ogt + C)/{2t), 

xi...xt£A^ 

(17) 

where C is a constant. 

Moreover, in a certain sense this average error is minimal: it is shown by 
Krichevsky [23] that for any predictor 7 there exists such a source P* that 

Pt{i,p*)>m\-mogt+c')/{2t). 

Hence, the bound ((|^| — l)logt + C)/{2t) cannot be reduced and the 
Krichevsky estimator is the best (up to 0{\/t)) if the error is measured 
by the KL divergence p. 
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2.2 Consistent Estimations and On-line Predictors for Markov 
and Stationary Ergodic Processes 

Now we briefly describe consistent estimations of unknown probabilities and 
efficient on-line predictors for general stochastic processes (or sources of 
information) . 

First we give a formal definition of stationary ergodic processes. The 
time shift T on A°° is defined as r(xi, X2, X3, . . . ) = (x2, 2:3, • • • )• ^ process 
P is called stationary if it is T-invariant: P{T'^^B) = P{B) for every Borel 
set B C j4°°. a stationary process is called ergodic if every T-invariant set 
has probability or 1: P{B) = or 1 whenever T~^B = B [5l I14|. 

We denote by Mqo {A) the set of all stationary and ergodic sources and let 
Mq{A) C Moo{A) be the set of aU i.i.d. processes. We denote by Mm{A) C 
Mao{A) the set of Markov sources of order (or with memory, or connectivity) 
not larger than m, m > 0. By definition /i S Mm{A) if 

^i{xt+l = ai^\xt = ai2,xt-i = ai3, ...,xt-m+i = ai„+i, •••) (18) 

= fJ-ixt+l = Ojjxt = ai^,Xt-l = Ojg, ... ,Xt-m+l = Oim+i) 

for all t >m and 0^1,0^2, .. . £ A. Let M*{A) = Ui^o^«(^) be the set of 
all finite-order sources. 

The Laplace and Krichevsky predictors can be extended to general Markov 
processes. The trick is to view a Markov source p € Mm{A) as resulting from 
1^1™ i.i.d. sources. We illustrate this idea by an example [S]. So assume 
that A = {O, I}, m = 2 and assume that the source p G M2{A) has gener- 
ated the sequence 

OOIOIIOOIIIOIO. 
We represent this sequence by the following four subsequences: 

^ ^ ^ ^ i}^ ^ 

*o * * * */*, 

*>K****0*** 10 * *. 

These four subsequences contain letters which follow OO, O/, 10 and //, 
respectively. By definition, p £ Mm{A) if p{a\xt • ■ ■ xi) = p{a\xt ■ ■ ■ xt-m+i)i 
for all < m < t, all a G j4 and all xi---xt £ A*. Therefore, each of 
the four generated subsequences may be considered to be generated by an 
i.i.d. source. Further, it is possible to reconstruct the original sequence if 
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we know the four (= 1^1™") subsequences and the two (= m) first letters of 
the original sequence. 

Any predictor 7 for i.i.d. sources can be applied to Markov sources. 
Indeed, in order to predict, it is enough to store in the memory \A\"^ se- 
quences, one corresponding to each word in A^. Thus, in the example, the 
letter X3 which follows 00 is predicted based on the i.i.d. method 7 cor- 
responding to the X1X2- subsequence (= OO), then X4 is predicted based 
on the i.i.d. method corresponding to X2X3, i.e. to the 01- subsequence, 
and so forth. When this scheme is applied along with either Lq or Kq we 
denote the obtained predictors as Lm and Km, correspondingly, and define 
the probabilities for the first m letters as follows: Lm{xi) = Lm{x2) = 
. . . = Lmixm) = l/\A\ , Kmixi) = Km{x2) = ... = Km{xm) = 1/|^| • For 
example, having taken into account (jl6p . we can present the Krichevsky 
predictors for Mm{A) as follows: 



Kra{xi...Xt) 



1 

\A\- 



if t < m , 

(19) 



1 n n.sA ((r(^.K)+i/2)/r(i/2)) 

\A\"^ llveA"^ (r(P^(t;)+|A|/2)/r(|A|/2)) ' r^"'-: 



where I'xiv) = J2aeA '^xiva), x = xi...xt. It is worth noting that the repre- 
sentation (I14p can be more convenient for carrying out calculations if t is 
small. 

Example. For the word OOIOIIOOIIIOIO considered in the previous 
example, we obtain K2{OOIOIIOOIIIOIO) = 2-^ \l \\\ \\\ . 

Here groups of multipliers correspond to subsequences /I, OHO, lOI, OIO. 

In order to estimate the error of the Krichevsky predictor we need 
a general definition of the Shannon entropy. Let P be a stationary and 
ergodic source generating letters from a finite alphabet A. The m— order 
(conditional) Shannon entropy and the limiting Shannon entropy are defined 
as follows: 

hm(P)= V P{v)y^ P{a/v) log P{a/v), h^ir) = lim /i„(P). 

(20) 

(If m = we obtain the definition (112p .) It is also known that for any m 

hoo{P) < hm{P) (21) 

mm- 
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Claim 3. For any stationary and ergodic source P generating letters from a 
finite alphabet A the average error of the Krichevsky predictor Km is upper 
bounded as follows: 

-t-' P{x,...xt)\og{Km{x,...x,))-hm{P) < 

xi...xt&A* 

(22) 

where C is a constant. 

The following so-called empirical Shannon entropy, which is an estima- 
tion of the entropy (j20p . will play a key role in the hypothesis testing. It 
will be convenient to consider its definition here, because this notation will 
be used in the proof of the next claims. Let v = vi...Vk and x = xiX2 ■ ■ ■ xt 
be words from A*. Denote the rate of a word v occurring in the sequence 
X = xiX2 ...Xk , X2X3 . . . Xfc+i, X3X4 . . . Xfc+2, . . ., xt_k+i . . . Xt as Vx{v)- For 
example, if x = 000100 and v = 00, then Vx{^^) = 3. For any < A; < t the 
empirical Shannon entropy of order k is defined as follows: 

where x = xi . . . xj, Uxiv) = YliaeA ^x{va). In particular, if A; = 0, we obtain 
hl{x) = -t-i Y.a&A ^^^(«) ^og{vx{a)/t) . 

Let us define the measure i2, which, in fact, is a consistent estimator 
of probabilities for the class of all stationary and ergodic processes with a 
finite alphabet. First we define a probability distribution {lo = 001,002, 
on integers {1, 2, ...} by 

a;i = 1 - 1/ log 3, ... , u;^ = 1/ \og{i + 1) - 1/ log(i + 2), ... . (24) 

(In what follows we will use this distribution, but results described below are 
obviously true for any distribution with nonzero probabilities.) The measure 
R is defined as follows: 

00 

R{xi...xt) = ^ iOi+i Ki{xi...xt). (25) 
i=0 

It is worth noting that this construction can be applied to the Laplace 
measure (if we use Li instead of Ki) and any other family of measures. 

Example. Let us calculate i?(00), i2(ll). From ([HD and ^ we 
obtain: 

i^o(OO) = i^o(ll) = ^ ^ = 3/8, Ko(Ol) = Kom = i/1 = 1/8, 



11 



Ki{00) = K,{01) = K,{W) = K,{n) = 1/4; , i > 1. 

Having taken into account the definitions of coi (j24|) and the measure R (|25p , 
we can calculate R{ziZ2) as follows: 

i?(00) =a;ii^o(00)+a;2i^i(00)+... = (1-1/ log 3) 3/8+(l/ log 3-1/ log 4) 1/4+ 

(l/log4- l/log5) 1/4+ . . . = (1 - l/log3) 3/8 + (1/ log 3) 1/4 k, 0.296. 

Analogously, i?(01) = i?(10) ^ 0.204, i?(ll) k, 0.296. 

The main properties of the measure R are connected with the Shannon 
entropy ([20l) . 

Theorem 1 ([35j). For any stationary and ergodic source P the following 
equalities are valid: 

i) lim ^log(l/i?(xi---xt)) = h^{P) 

with probability 1, 

ii) lim \ V P(n)log(l/i?(n)) = hoo{P). 

So, if one uses the measure R for data compression in such a way that 
the codeword length of the sequence xi - ■ ■ xt is (approximately) equal to 
log(l/ii(j;i • • • Xt)) bits, he/she obtains the best achievable data compression 
ratio hao{P) per letter. On the other hand, we know that the redundancy of 
a universal code and the error of corresponding predictor are equal. Hence, 
if one uses the measure R for estimation and/or prediction, the error (per 
letter) will go to zero. 

2.3 Hypothesis Testing 

Here we briefly describe the main notions of hypothesis testing and the two 
particular problems considered below. A statistical test is formulated to test 
a specific null hypothesis {Hq). Associated with this null hypothesis is the 
alternative hypothesis {Hi) [33]. For example, we will consider the two fol- 
lowing problems: goodness-of-fit testing (or identity testing) and testing of 
serial independence. Both problems are well known in mathematical statis- 
tics and there is an extensive literature dealing with their nonparametric 
testing [HElEllIl]. 
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The goodness-of-fit testing is described as follows: a hypothesis Hq is 
that the source has a particular distribution vr and the alternative hypoth- 
esis H\'^ that the sequence is generated by a stationary and ergodic source 
which differs from the source under Hjf. One particular case, mentioned in 
Introduction, is when the source alphabet A is {0, 1} and the main hypothe- 
sis H(f is that a bit sequence is generated by the Bernoulli i.i.d. source with 
equal probabilities of O's and I's. In all cases, the testing should be based 
on a sample xi . . . xt generated by the source. 

The second problem is as follows: the null hypothesis H^^ is that the 
source is Markovian of order not larger than m, {m > 0), and the alternative 
hypothesis Hf^ is that the sequence is generated by a stationary and ergodic 
source which differs from the source under H^^ . In particular, if m = 0, this 
is the problem of testing for independence of time series. 

For each applied test, a decision is derived that accepts or rejects the 
null hypothesis. During the test, a test statistic value is computed on the 
data (the sequence being tested). This test statistic value is compared to 
the critical value. If the test statistic value exceeds the critical value, the 
null hypothesis is rejected. Otherwise, the null hypothesis is accepted. So, 
statistical hypothesis testing is a conclusion-generation procedure that has 
two possible outcomes: either accept Hq or accept Hi. 

Errors of the two following types are possible: The Type I error occurs if 
Hq is true but the test accepts Hi and, vice versa, the Type II error occurs 
if Hi is true, but the test accepts Hq. The probability of Type I error is 
often called the level of significance of the test. This probability can be set 
prior to the testing and is denoted a. For a test, a is the probability that the 
test will say that Hq is not true when it really is true. Common values of a 
are about 0.01. The probabilities of Type I and Type II errors are related 
to each other and to the size n of the tested sequence in such a way that 
if two of them are specified, the third value is automatically determined. 
Practitioners usually select a sample size n and a value for the probability 
of the Type I error - the level of significance [ 33] . 

2.4 Codes 

We briefly describe the main definitions and properties (without proofs) 
of lossless codes, or methods of (lossless) data compression. A data com- 
pression method (or code) ip is defined as a set of mappings (pn such that 
: A" — > {0, 1}*, n = 1, 2, . . . and for each pair of different words x, y G A" 
(pn{x) / ^n{y)- It is also required that each sequence (pn{ui)ip 
r > 1, of encoded words from the set A'^,n > 1, could be uniquely decoded 
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into uiU2...Ur. Such codes are called uniquely decodable. For example, let 
A = {a, 6}, the code '4)i{a) = d,il)i{h) = 00, obviously, is not uniquely de- 
codable. In what follows we call uniquely decodable codes just "codes". It 
is well known that if (/? is a code then the lengths of the codewords satisfy 
the following inequality (Kraft's inequality) [H] : S^gA" 2~l'^"(")l < 1 . It 
will be convenient to reformulate this property as follows: 

Claim 4. Let ip he a code over an alphabet A. Then for any integer n there 
exists a measure on A" such that 

-log/x^(n) < \^{u)\ (26) 

for any u from A"" . 

(Obviously, this claim is true for the measure Hin(u) = - — ^ ' )• 
It was mentioned above that, in a certain sense, the opposite claim is 

true, too. Namely, for any probability measure /i defined on A^, n > 1, there 

exists a code ip^ such that 

\ip^{u)\ = - log i^{u). (27) 

(More precisely, for any e > one can construct such a code (^9*, that 
I'/'/K^)! < — log/i(ii) + e for any u £ A^. Such a code can be constructed 
by applying a so-called arithmetic coding [31j .) For example, for the above 
described measure R we can construct a code Rcode such that 

\Rcode{u)\ = - log R{u). (28) 

As we mentioned above there exist universal codes. For their description we 
recall that sequences xi . . .xt, generated by a source P, can be "compressed" 
to the length —logP{xi...xt) bits (see ()27p ) and, on the other hand, for 
any source P there is no code ip for which the average codeword length 
( S^gy^t P{u)\^p{u)\ ) is less than — S„g^t P{u) log P{u). Universal codes can 
reach the lower bound — log P{xi...xt) asymptotically for any stationary and 
ergodic source P in average and with probability 1. The formal definition 
is as follows: a code U is universal if for any stationary and ergodic source 
P the following equalities are valid: 

lim \U{xi...xt)\/t = h^{P) (29) 

f— >oo 

with probability 1, and 

lim E{\U{xi...xt)\)/t = hoo{P), (30) 
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where E{f) is the expected value of /, /loo(-P) is the Shannon entropy of P, 
see ()2ip . So, informaUy speaking, a universal code estimates the probability 
characteristics of a source and uses them for efficient "compression". 

In this chapter we mainly consider finite-alphabet and real- valued sources, 
but sources with countable alphabet also were considered by many authors 
[H 1161 [T8 l 139 1 Bo]. Ill particular, it is shown that, for infinite alphabet, with- 
out any condition on the source distribution it is impossible to have universal 
source code and/or universal predictor, i.e. such a predictor whose average 
error goes to zero, when the length of a sequence goes to infinity. On the 
other hand, there are some necessary and sufficient conditions for existence 
of universal codes and predictors [H [T8| [39]. 

3 Finite Alphabet Processes 

3.1 The Estimation of (Limiting) Probabilities 

The following theorem shows how universal codes can be applied for prob- 
ability estimation. 

Theorem 2. Let U be a universal code and 

/ic;(tx) = 2"I^HI/S,g^H 2-1^(^)1. (31) 

Then, for any stationary and ergodic source P the following equalities are 
valid: 

i) \\m.\{-\ogP{xi---xt)-{-\ogljLu{xi---Xt))) =0 

t— >oo t 

with probability 1, 

a) lim - y P{u)\og{P{u)ltJLu{y)) = 0. 

The informal outline of the proof ia as follows: j{— log P{xi ■ ■ ■ xt) and 
j{— \og^jj{xi • • • Xt)) goes to the Shannon entropy hao{P), that is why the 
difference is 0. 

So, we can see that, in a certain sense, the measure /if/ is a consistent 
nonparametric estimation of the (unknown) measure P. 

Nowadays there are many efficient universal codes (and universal predic- 
tors connected with them), which can be applied to estimation. For example, 
the above described measure R is based on a universal code [Ml ES] and can 
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be applied for probability estimation. More precisely, Theorem [2] (and the 
following theorems) are true for R, if we replace /j^u by R. 

It is important to note that the measure R has some additional proper- 
ties, which can be useful for applications. The following theorem describes 
these properties (whereas all other theorems are valid for all universal codes 
and corresponding measures, including the measure R). 

Theorem 3. ( 134\ \35^ ) For any Markov process P with memory k 

i) the error of the probability estimator, which is based on the measure 
R, is upper-bounded as follows: 



a) the error of R is asymptotically minimal in the following sense: for 
any measure fi there exists a k— memory Markov process p^ such that 



Hi) Let Q be a set of stationary and ergodic processes such that there exists 
a measure /ie for which the estimation error of the probability goes to 
uniformly: 



Then the error of the estimator which is based on the measure R, goes 
to uniformly too: 



3.2 Prediction 

As we mentioned above, any universal code U can be applied for prediction. 
Namely, the measure (f3T|) can be used for prediction as the following 
conditional probability: 








(32) 
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The following theorem shows that such a predictor is quite reasonable. More- 
over, it gives a possibility to apply practically used data compressors for pre- 
diction of real data (like EUR/USD rate) and obtain quite precise estimation 

m ■ 

Theorem 4. Let U be a universal code and P he any stationary and ergodic 
process. Then 

t) hm - ^{log \ \ + log \ ' \ + . . . + log -^-^ = 0, 

t^oo t fJ.u{Xl) fJ-u{X2\Xl) IJ.u{Xt\Xl...Xt-l) 



1 

ii) lim £'(- y^(P{xi+i\xi...Xi) - fj.u{xi+i\xi...Xi)f) = 0. 



' .=0 



and 



1 

m) lim £'(- 'S^\P{xi+i\xi...Xi)-iJ.u{xi+i\xi...Xi)\) = 0. 

t-^oo t ^ — ' 
i=0 



An informal outline of the proof is as follows: 
i {£(log + E(log ^^^filfll) + . . . + £(log ^'fl"'-"-'' )} 

t fJ-U[Xl) fJ.U{X2\xl) flu{Xt\Xl...Xt-l) 

is equal to jE'flog _^(£k:£i*) ^ Taking into account Theorem[2l we obtain the 
first statement of the theorem. 

Comment 1. The measure R described above has one additional prop- 
erty if it is used for prediction. Namely, for any Markov process P {P £ 
M*{A)) the following is true: 

P{xt+i\xi...xt) ^ 
hm log — ■ = 

t-^oo R[Xt+l\Xi...Xt) 

with probability 1, where R{xt^i\xi...Xt) = R{xi...XtXt+i) / R{xi...Xt) [36]. 

Comment 2. It is known ^45] that, in fact, the statements ii) and iii) 
are equivalent. 



3.3 Problems with Side Information 

Now we consider the so-called problems with side information, which are 
described as follows: there is a stationary and ergodic source whose alphabet 
A is presented as a product A = X xY. We are given a sequence (xi, yi), . . . , 
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{xt-i,yt~i) and side information yt. The goal is to predict, or estimate, xt- 
This problem arises in statistical decision theory, pattern recognition, and 
machine learning. Obviously, if someone knows the conditional probabilities 
P{xt\ {xt-i,yt-i),yt) for all xt £ X, he has all information 

about Xf, available before Xt is known. That is why we will look for the 
best (or, at least, good) estimations for this conditional probabilities. Our 
solution will be based on results obtained in the previous subsection. More 
precisely, for any universal code U and the corresponding measure fiu ([3T|) 
we define the following estimate for the problem with side information: 



The following theorem shows that this estimate is quite reasonable. 

Theorem 5. Let U be a universal code and let P be any stationary and 
ergodic process. Then 



^^u{xt\{xl,yl), . . . , {xt-i,yt~i),yt) 



lJ'u{{xi,yi), . . . , {xt-i,yt-i), {xt,yt)) 



T^xtdX lJ'u{{xi,yi), {xt,yt)) ' 




+E{\og 



Pixt\{xi,yi), {xt-i,yt-i),yt) 
fiu{xt\{xi,yi), {xt-i,yt-i),yt) 



)} = 0, 




t-i 




and 




i=0 



fiu{xi+i\{xi,yi),...,{xi,yi),yi+i)\) = 0. 



The proof is very close to the proof of the previous theorem. 
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3.4 The Case of Several Independent Samples 

In this part we consider a situation which is important for practical apph- 
cations, but needs cumbersome notations. Namely, we extend our consid- 
eration to the case where the sample is presented as several independent 
samples = x\. . .xl_^, = xf . . .x^^, x^ = x\ . . . generated by a 
source. More precisely, we will suppose that all sequences were indepen- 
dently created by one stationary and ergodic source. (The point is that it 
is impossible just to combine all samples into one, if the source is not i.i.d.) 
We denote them by x^ox^o. . .ox^ and define i^x^ox^o...ox^{'^) = Yll=i ^x^i'^)- 
For example, if x^ = 0010, = Oil, then ^'2,1^2,2 (00) = 1. The definition of 
Km and R can be extended to this case: 

Km{x^ ox^ o ...ox'') = (33) 

.^|,|-minWO^ TT WaeA m^x^<>x^o....xr (H + 1/2) / r(l/2)) 

(r(p,x,,2,...,,.(^) + |A|/2)/r(|A|/2)) ' 

whereas the definition of R is the same (see (p5]) ). (Here, as before, 

^x^ox'2o...ox^{v) = Y.aeA'^x^ox2o...ox^{va). Note, that j^^iox2o...oa:'-( ) = I]I=i*j 

if m = 0.) 

The following example is intended to show the difference between the 
case of many samples and one. 

Example. Let there be two independent samples y = yi . ■ - Vi = 0101 
and X = xi . . . X3 = 101, generated by a stationary and ergodic source 
with the alphabet {0, 1}. One wants to estimate the (limiting) probabilities 
P{ziZ2)-, zi, Z2 € {0, 1} (here 21^2 . . . can be considered as an independent se- 
quence, generated by the source) and predict X4X5 (i.e. estimate conditional 
probability P(x4X5|xi . . . X3 = 101, j/i . . . 2/4 = 0101). For solving both prob- 
lems we will use the measure R (see (|25p ). First we consider the case where 
P{ziZ2) is to be estimated without knowledge of sequences x and y. Those 
probabilities were calculated previously and we obtained: R{00) ~ 0.296, 
R{01) = R{10) ^ 0.204, R{11) « 0.296. Let us now estimate the proba- 
bility P(ziZ2) taking into account that there are two independent samples 
y = yi ■ ■ ■ yi = 0101 and x = xi . . . X3 = 101. First of all we note that such 
estimates are based on the formula for conditional probabilities: 

R{z\x oy) = R{x oyoz) / R[x o y). 

Then we estimate the frequencies: z^oioioioi(O) = 3, i^oioioioi(l) = 4, 
z^oioioioi (00) = i^oioioioi(ll) = 0, z^oioioioi(Ol) = 3, z^oioi«ioi(10) = 2, 
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i^oioioioi(OlO) = 1, i^oioioioi(lOl) = 2, i^oioioioi(OlOl) = 1, whereas fre- 
quencies of all other three-letters and four-letters words are 0. Then we 
calculate : 



« 0.0293, K2(0101 o 101) « 0.01172, 0101 o 101) = 2"^, i > 3, 

i?( 0101 o 101) = u;iKo( 0101 o 101) +u;2i^i( 0101 o 101) + ...^ 

0.369 0.00244 + 0.131 0.0293 + 0.06932 0.01172 + 2^'^ / log 5 f« 0.0089. 

In order to avoid repetitions, we estimate only one probability P(ziZ2 = 01). 
Carrying out similar calculations, we obtain i?(0101 o 101 o 01) ~ 0.00292, 
R{ziZ2 = 01|yi ...y4 = 0101, j;i . . . X3 = 101) = i?(0101o 101 o01)/i?( 0101 o 
101) ~ 0.32812. If we compare this value and the estimation R(01) ~ 0.204, 
which is not based on the knowledge of samples x and y, we can see that the 
measure R uses additional information quite naturally (indeed, 01 is quite 
frequent in y = yi . . . 2/4 = 0101 and x = xi . . . X3 = 101). 

Such generalization can be applied to many universal codes, but, gener- 
ally speaking, there exist codes U for which U{x^ o x^) is not defined and, 
hence, the measure nuixi 0x2) is not defined. That is why we will describe 
properties of the universal code R, but not of universal codes in general. For 
the measure R all asymptotic properties are the same for the cases of one 
sample and several samples. More precisely, the following statement is true: 

Claim 5. Let be independent sequences generated by a sta- 

tionary and ergodic source and let t be a total length of these sequences 
(t = X]i=i 1^*1)- Then, if t — > 00, (and r is fixed) the statements of the The- 
orems [2-0 are valid, when applied to x^ o x'^ o ... o x^ instead of xi . . . xt- 
(In theorems\^ - should be changed to R.) 

The proofs are completely analogous to the proofs of the Theorems [2]— [5j 
Now we can extend the definition of the empirical Shannon entropy (j23p 

to the case of several words x^ = x\ . . . xj^, x'^ = xf . . . x^^, x^ = x\ . . . x^^. 

We define i^x'^ox^o...ox^{'v) = Si=i^x»(^)- For example, if x^ = 0010, = 

Oil, then 1/^1^^2(00) = 1. Analogously to ([23]) . 




i^i(0101ol01) = (2-1)2 i-- ^13 
^ 7 V ; 246 24 



hl{x^ ox"^ o ...ox"^) 



E 



'x^o...ox' 



E 



'x^o...ox^ 



(va) 



log 



'x^o...ox^ 



(va) 



{t - kr) 




(34) 
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where ry^io...ox-{v) = T>aGA'^x^o...oxr{va)- 

For any sequence of words a;-*- = x\...xl_^, = x^ . . . x^^, x^ = 
x\. . . from A* and any measure 6 wc define 9{x'^ o o . . . o x'^) = 
YYi=i The following lemma gives an upper bound for unknown prob- 

abilities. 

Lemma 1. Let 9 he a measure from M^(A), m > 0, and x^, . . . , x*" he words 
from A* , whose lengths are not less than m. Then 

e{x^ O . . . O X'') < 2-(*-'''") hUx^o...ox*) ^ (-35^ 

where 9{x^ o ...ox'^) = 111=1 0{x''). 

4 Hypothesis Testing 

4.1 Goodness-of-Fit or Identity Testing 

Now we consider the problem of testing H^f against Hl"^. Let us recall that 
the hypothesis Hff is that the source has a particular distribution tt and the 
alternative hypothesis H\'^ that the sequence is generated by a stationary and 
ergodic source which differs from the source under H^'^. Let the required 
level of significance (or the Type I error) be a, a G (0, 1). We describe a 
statistical test which can be constructed based on any code (p. 

The main idea of the suggested test is quite natural: compress a sample 
sequence xi...xt by a code (p. If the length of the codeword {\ip{xi...xt)\) is 
significantly less than the value — log 7r(xi...xt), then Hq^ should be rejected. 
The key observation is that the probability of all rejected sequences is quite 
small for any ip, that is why the Type I error can be made small. The precise 
description of the test is as follows: The hypothesis is accepted if 

- log7r(xi...xt) - \ip{xi...xt)\ < -loga. (36) 

Otherwise, H^f is rejected. We denote this test by T^'^{A,a). 

Theorem 6. i) For each distribution tt, a G (0,1) and a code ip, the Type 
I error of the described test T^'^{A,a) is not larger than a and ii) if, in 
addition, it is a finite-order stationary and ergodic process over A°° (i.e. 
TT G M* (A) ) and p is a universal code, then the Type II error of the test 
T^'^{A, a) goes to 0, when t tends to infinity. 
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4.2 Testing for Serial Independence 

Let us recall that the null hypothesis Hq^ is that the source is Markovian of 
order not larger than m, (m > 0), and the alternative hypothesis Hf^ is that 
the sequence is generated by a stationary and ergodic source which differs 
from the source under H^^ . In particular, if m = 0, this is the problem of 
testing for independence of time series. 

Let there be given a sample xi...xt generated by an (unknown) source 
vr. The main hypothesis Hq^ is that the source vr is Markovian whose order 
is not greater than m, [m > 0), and the alternative hypothesis Hf^ is that 
the sequence is generated by a stationary and ergodic source which differs 
from the source under Hq^ . The described test is as follows. 

Let (f be any code. By definition, the hypothesis Hq^ is accepted if 

{t - m) h*^{xi...xt) - \^p{xi...xt)\ < log(l/a) , (37) 

where a G (0,1). Otherwise, Hq^ is rejected. We denote this test by 
(A«). 

Theorem 7. i) For any code (p the Type I error of the test T^^{A,a) is 
less than or equal to a, a £ (0,1) and, ii) if, in addition, ip is a universal 
code, then the Type II error of the test T^^{A, a) goes to 0, when t tends to 
infinity. 

5 Real- Valued Time Series 

5.1 Density Estimation and Its Application 

Here we address the problem of nonparametric estimation of the density 
for time series. Let Xt be a time series and the probability distribution 
of Xt is unknown, but it is known that the time series is stationary and 
ergodic. We have seen that Shannon-MacMillan-Breiman theorem played 
a key role in the case of finite-alphabet processes. In this part we will use 
its generalization to the processes with densities, which was established by 
Barron [3]. First we describe considered processes with some properties 
needed for the generalized Shannon-MacMillan-Breiman theorem to hold. 
In what follows, we restrict our attention to processes that take bounded 
real valued. However, the main results may be extended to processes taking 
values in a compact subset of a separable metric space. 

Let B denote the Borel subsets of R, and B^ denote the Borel subsets 
of R'^, where R is the set of real numbers. Let R°° be the set of all in- 
finite sequences x = xi,X2... with Xi G R, and let B°° denote the usual 
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product sigma field on R°° , generated by the finite dimensional cylinder sets 
{^1, . . . Af:, R, R, . . .}, where Ai £ B,i = 1, . . . ,k. Each stochastic process 
Xi, X2, ■ ■ ■ , Xi S R, is defined by a probability distribution on {R°° , B°°). 
Suppose that the joint distribution P„ for {Xi, X2, ■ ■ ■ , Xn) has a proba- 
bility density function p{xiX2 • • • x„) with respect to a sigma- finite measure 
M„. Assume that the sequence of dominating measures M„ is Markov of 
order m > with a stationary transition measure. A familiar case for M„ 
is Lebesgue measure. Let p{xn+i\xi . . denote the conditional density 
given by the ratio p{xi . . . Xn+i) /p{xi ■ ■ ■ Xn) for n > 1. It is known that 
for stationary and ergodic processes there exists a so- called relative entropy 
rate h defined by 

h= lim -E{logp{xn+i\xi . . .Xn)), (38) 

n— >oo 

where E denotes expectation with respect to P. We will use the following 
generalization of the Shannon-MacMillan-Breiman theorem: 



Claim 6 ([3j). If {Xn} is a P— stationary ergodic process with density 
p{xi...Xn) = dPn/dMn and hn < 00 for some n > m, the sequence of 
relative entropy densities —{l/n)logp{xi...Xn) convergence almost surely 
to the relative entropy rate, i.e., 

lim (— 1/n) logp(xi . . . Xn) = h (39) 

with probability 1 (according to P). 

Now we return to the estimation problems. Let {n„},n > 1, be an 
increasing sequence of finite partitions of R that asymptotically generates the 
Borel sigma-field B and let x^^"^ denote the element of that contains the 
point x. (Informally, x^^^ is obtained by quantizing x to A; bits of precision.) 
For integers s and n we define the following approximation of the density 

f{xi ...xn) = P{xf . . . xH)/M„(xS^] . . . xW). (40) 
We also consider 

hs= lim -E{\ogp''{xn+i\xi . . .Xn)). (41) 

n— >oo 

Applying the claim 2 to the density p^{xi . . . xt), we obtain that a.s. 

lim -^log/(xi...Xi) = /is. (42) 

t— >oo t 
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Let f7 be a universal code, which is defined for any finite alphabet. In 
order to describe a density estimate we will use the probability distribution 
uji,i = 1,2, ... , see (j24p (In what follows we will use this distribution, but 
results described below are obviously true for any distribution with nonzero 
probabilities.) Now we can define the density estimate ru as follows: 

oo 

ru{xi ...xt) = Y,^i M^f ■ ■ ■ xf)/Mt{xf . . . xfl) , (43) 

2=0 

where the measure nu is defined by ([5T|) . (It is assumed here that the code 
U{x^^^ . . . x[*') is defined for the alphabet, which contains |nj| letters.) 

It turns out that, in a certain sense, the density r[/(xi . . . xt) estimates 
the unknown density p{xi . . . xt). 

Theorem 8. Let Xt be a stationary ergodic process with densities p{xi . . . xt) 
= dPt/dMt such that 

lim hs = h < oo, (44) 



where h and hg are relative entropy rates, see (j38p . ()4ip . Then 



1 p{xi...xt) . . 

lim - log — -. r- = (45) 

t-^co t ru{xi...xt) 



with probability 1 and 



lim - i?(log Zi^llll^) = . (46) 
t-oot ^ ruixi...xt)' ^ ' 



We have seen that the requirement (j44p plays an important role in the 
proof. The natural question is whether there exist processes for which ()44p 
is valid. The answer is positive. For example, let a process possess values 
in the interval [—1, 1], M„ be Lebesgue measure and the considered process 
is Markovian with conditional density 



p{x\y) 



1/2 + a sign{y), if x < 
1/2 — a sign{y), if x > . 

where a E (0, 1/2) is a parameter and 



sign{y) 



-1, ify<o, 
1, ify>o. 
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In words, the density depends on a sign of the previous value. If the value 
is positive, then the density is more than 1/2, otherwise it is less than 1/2. 
It is easy to see that is true for any a S (0, 1). 

The following two theorems are devoted to the conditional probabil- 
ity ru(x\x\...Xm) = i'u{xi...Xmx)/rir{xi...Xm) which, in turn, is connected 
with the prediction problem. We will see that the conditional density 
xi...Xm'} is a reasonable estimation of the unknown density p{x\xi...Xm)' 

Theorem 9. Let Bi,B2,.-. be a sequence of measurable sets. Then the 
following equalities are true: 



lim E{- y {P{xm+i G B.m+i\ )-Ru{Xm+l G Bm+llxi-.-Xm))"^) = 0. 

:-+00 t ^ ' 

m=0 

(47) 

1 

ii) E{- ^ \P{Xm+l G Bjn+l\xi...Xm) - Ru{Xm+l G Bm+l\xi...Xm))\ = 0, 



m=0 



t-1 

t 

where Ru{xm+i G Bm+i\xi...Xm) = Sb^+-, ru{x\xi...Xm)dMii^ 

We have seen that in a certain sense the estimation ru approximates 
the unknown density p. The following theorem shows that ru can be used 
instead of p for estimation of average values of certain functions. 

Theorem 10. Let / be an integrable function, whose absolute value is 
bounded by a certain constant M and all conditions of the theorem 2 are 
true. Then the following equality is valid: 



l) ^}^jE{'^{j f{x)p{x\xi...Xm)dMm- J /(x) rj; (x| Xi .. = 0, 



t 

m=0 



(48) 



Xi...Xin )dMm— f{x)ru{x\xi...Xm)dMm\)=0. 



It is worth noting that this approach was used for prediction of real 
processes [H]. 



5.2 Hypothesis Testing 

In this subsection we consider where the source alphabet A is infinite, 

say, a part of R". Our strategy is to use finite partitions of A and to consider 
hypotheses corresponding to the partitions. This approach can be directly 
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applied to the goodness-of-fit testing, but it cannot be applied to the serial 
independence testing. The point is that if someone combines letters (or 
states) of a Markov chain, the chain order (or memory) can increase. For 
example, if the alphabet contains three letters, there exists a Markov chain of 
order one, such that combining two letters into one transforms the chain into 
a process with infinite memory. That is why in this part we will consider the 
independence testing for i.i.d. processes only (i.e. processes from Mo{A)). 

In order to avoid repetitions, we will consider a general scheme, which 
can be applied to both tests using notations HQ,Hf and T^{A,a), where 
H is an abbreviation of one of the described tests (i.e. id and SI.) 

Let us give some definitions. Let A = Ai, be a finite (measurable) 
partition of A and let A{x) be an element of the partition A which contains 
X £ A. For any process vr we define a process tt\ over a new alphabet A by 
the equation 

vrA(Aii...AjJ = 7r{xi e Xi^,...,Xk € AjJ, 

where xi...Xk £ A''. 

We will consider an infinite sequence of partitions A = Ai, A2, .... and say 
that such a sequence discriminates between a pair of hypotheses (A) , Hf (A) 
about processes, if for each process g, for which Hf{A) is true, there exists 
a partition Aj for which Hf {Aj ) is true for the process g^j ■ 

Let Hq{A), Hi{A)^ be a pair of hypotheses, A = Ai, A2, ... be a sequence 
of partitions, a be from (0, 1) and be a code. The scheme for both tests 
is as follows: 

The hypothesis Hq{A) is accepted if for all i = 1,2, 3, ... the test T^{Ai, (awj)) 
accepts the hypothesis (Aj). Otherwise, Hq is rejected. We denote this 
test T^,^(A). 

Comment 3. It is important to note that one does not need to check 
an infinite number of inequalities when applying this test. The point is that 
the hypothesis Hq{A) has to be accepted if the left part in (i36]l or (f37|l is less 
than —log{au!i). Obviously, — log(au;j) goes to infinity if i increases. That 
is why there are many cases, where it is enough to check a finite number of 
hypotheses -ffg (^j)- 

Theorem 11. i) For each a £ (0,1), sequence of partitions A and a code 
If, the Type I error of the described test I^,^(A) is not larger than a, 

and ii) if, in addition, (p is a universal code and A discriminates between 
Hq{A),Hi{A)^, then the Type II error of the test T^ ,^(A) goes to 0, when 
the sample size tends to infinity. 
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6 Conclusion 



Time series is a popular model of real stochastic processes which has a lot 
of applications in industry, economy, meteorology and many other fields. 
Despite this, there are many practically important problems of statistical 
analysis of time series which are still open. Among them we can name 
the problem of estimation of the limiting probabilities and densities, on-line 
prediction, regression, classification and some problems of hypothesis testing 
(goodness-of-fit testing and testing of serial independence). This chapter 
describes a new approach to all the problems mentioned above, which, on 
the one hand, gives a possibility to solve the problems in the framework of 
the classical mathematical statistics and, on the other hand, allows to apply 
methods of real data compression to solve these problems in practise. Such 
applications to randomness testing [42j and prediction of currency exchange 
rates showed high efficiency, that is why the suggested methods look very 
promising for practical applications. Of course, problems like prediction of 
price of oil, gold, etc. and testing of different random number generators 
can be used as case studies for students. 



7 Appendix 

Claim [H We employ the general inequality 

Z)(/x||r/) <loge (-l + ^^(a)V^?(a)), 

valid for any distributions /j, and rj over A (follows from the elementary 
inequality for natural logarithm Inx < x — 1), and find: 

p*(P||Lo) = V P(xi • • • xt) V P(a|xi • • • X,) log ^.^"l"^^"""^;^ 
= loge( P{xi---xt) y^P{a\xi---xt)ln^Y^ ^) 

xi-zteA* a€A J < 

Applying the well-known Bernoulli formula, we obtain 



p\P\\Lo) = log e (-1 + J] Pia) (t+ \A\) ^ t ^^^^y^^ _ 

a£A i=0 
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= loge(-l + ^ j:P(a)^( \ + \ )P{aY+\l-P{a)r^) 

< loge (-1 + ^ Yl E( * t ' - 

Again, using the Bernoulli formula, we finish the proof 

\A\ - 1 



/(P||Lo) = loge- 



t + 1 



The second statement of the claim follows from the well-known asymptotic 
equality 

1 + 1/2 + 1/3 + ... + 1/t = Int + 0(1), 
the obvious presentation 

p\P\\Lo) = t"Hp\P\\Lo) + pHp\\Lo) + ... + p'-\P\\Lo)) 

and HHUh. □ 



Claim 0. The first equality follows from the definition ([9]) , whereas the sec- 
ond from the definition (1121). From (1161) we obtain: 



, r(|^|/2) UaeAnAa) + m , 

-logKo(xi....,) = -log(^;(^ r((t + 1^1/2) ^ 

= ci + C2|A| + iogr(t + 1^1/2) - r(^^*(a) + 1/2), 

where ci, C2 are constants. Now we use the well known Stirling formula 

Inr(s) =ln\/2^+(s-l/2)lns-s + 6'/12, 
where 6 G (0, 1) [22] . Using this formula we rewrite the previous equality 



as 



- \ogKo{xi...xt) = -Yl ^*(") log(^^*(a)A) + (1^1 - 1) log*/2 + ci + C2\A\ 

where ci,C2 are constants. Hence, 

J2 P{xi...xt){-log{Ko{xi...xt))) 

xi...xt&A* 
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<ti P(xi...Xi)(- j;r/*(a) log(i/*(a)/t)) + (1^1-1) log t/2 + c|^|. 

xi...xt€A^ a£A 

Applying the well known Jensen inequality for the concave function —x log x 
we obtain the following inequality: 

J2 P{xi...xt){-log{Ko{xi...xt))< 

xi...Xt£A^ 

-t{ P{xi...xt){{u\a)/t)) 

xi...xt&A* 

log P{xi---Xt){iy\a)/t) + {\A\-l)logt/2 + c\A\. 

The source P is i.i.d., that is why the average frequency 

Y P{xi...xt)ij\a) 

Xl...Xt€:A* 

is equal to P{a) for any a G A and we obtain from two last formulas the 
following inequality: 

Y P{xi . . .xt){-log{Ko{xi...xt)) 

xi...xt€A* 

< t{- Y Pia) log P{a)) + {\A\- 1) log t/2 + c\A\ (49) 

aeA 

On the other hand, 

t 

Y P{xi...Xt){logP{xi...Xt)) = Y P{X1 . . .Xt)Y^'^SPi^i) 

xi...xt£A^ xi...xt&A^ i=l 

= t{YP{o)\ogP{a)). (50) 

a&A 

From d?]) and ([7]) we can see that 

Y ^(^1 ■■■xt) log ^!f] • • • '''\ < {{\A\ - 1) logt/2 + c)/t. 

□ 
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Claiml^ First we consider the case where m = 0. The proof for this case is 
very close to the proof of the previous claim. Namely, from (jl6p we obtain: 

, ^ N 1 ,r(|^|/2) naGAr(i^*(a) + i/2), 

= ci + C2\A\ + logr(t + 1^1/2) - r(^^*(a) + 1/2), 
where ci, C2 are constants. Now we use the well known Stirling formula 



Inr(s) =lnV27r + (s-l/2)lns-s + 6'/12, 

where 9 S (0, 1) |22j . Using this formula we rewrite the previous equality 
as 

- logKo{xi...xt) = -J2 ^og{u\a)/t) + {\A\ - 1) logt/2 + ci + C2\A\, 

where ci,C2 are constants. Having taken into account the definition of the 
empirical entropy (j23p . we obtain 



- logKo{xi...xt) < th*o{xi ...xt) + {\A\ - 1) log V2 + c\A\. 

Hence, 

P{x^...xt){-log{Ko{xi...xt))) 

xi...xteA* 

<t{ P{xi...xt)h*o{xi...xt) + {\A\-l)logt/2 + c\A\. 

xi...xt£A^ 

Having taken into account the definition of the empirical entropy (j23p . we 
apply the well known Jensen inequality for the concave function — xlogx 
and obtain the following inequality: 

Y Pi^i ■ ■ ■ xt){- log{Ko{xi...xt)) < +c\A\ - 

Xi...XtS:A* 

t{ Y P{xi...Xt){[v\a)/t))\og J2 P{xi...xt){i^\a)/t)+{\A\-l)log 

xi...xt£A* xi...xt£A* 

P is stationary and ergodic, that is why the average frequency 

Y P{xi...xt)v\a) 
x\...xt&A* 
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is equal to P{a) for any a G A and we obtain from two last formulas the 
following inequality: 

^ Pixi...xt){-logiKoixi...xt))<tho{P) + {\A\-l)logt/2 + c\A\, 

xi...xt£A* 

where ho{P) is the first order Shannon entropy, see (jl2p . 

We have seen that any source from Mm{A) can be presented as a "sum" 
of l^l*" i.i.d. sources. From this we can easily see that the error of a 
predictor for the source from MmiA) can be upper bounded by the error 
of i.i.d. source multiplied by \A\"^. In particular, we obtain from the last 
inequality and the definition of the Shannon entropy ()20p the upper bound 
(EH). □ 



Theorem d We can see from the definition ()25p of R and the Claim [19] that 
the average error is upper bounded as follows: 

- Yl Pi^i-^t) log{R{xi...xt)) - hk{P) 

x\...xt&A^ 

< {\A\WA\ - 1) logt + log(lM) + C)/(2t), 

for any k = 0,1,2,.... Taking into account that for any P G Mao{A) 
lim^t^cxa hk{P) = hooiP), we can see that 

(limt-i V P{xi...xt)log{Rixi...xt))-h,o{P)) = 0. 

t^oo ^ ' 

xi...xt&A'^ 

The second statement of the theorem is proven. The first one can be easily 
derived from the ergodicity of P [5| IT3j . □ 

Theorem . The proof is based on the Shannon-MacMillan-Breiman theo- 
rem which states that for any stationary and ergodic source P 

lim -logP(a;i . ..xt)/t = /loo(P) 

with probability 1 O IH] . From this equality and ()29p we obtain the state- 
ment i). The second statement follows from the definition of the Shannon 
entropy ([211) and ([30]). □ 

Theorem^ i) immediately follows from the second statement of the theo- 
rem [2] and properties of log. The statement ii) can be proven as follows: 
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1 

lim E{- y^{P{xi+i\xi ...Xi)- iJ.u{xi+i\xi . ..Xi)f) 

t-^oo t ^ — ' 



i=0 

t-1 



lim y P{xi . . .Xi)(y^ \P{a\xi . . .Xi) - fiu{a\xi . . .Xi)\)'^ < 



t— >oo i 

«=0 a:;i...2:iGA* iGA 



^ ^ . . . X,) ^ P(a|a;i . . . x,) log ^^fj"; " ' ' ""^^ 



const 

lim / / - • • • -V / - v-i-i • • • -t/ -"to 

lim ( '^"'^'^^ P(xi . . .xj)log(P(xi . . .Xf)//Lt(xi . . .xt))). 

xi...xtGA* 

Here the first inequality is obvious, the second follows from the Pinsker's 
inequality ([5]), the others from properties of expectation and log . iii) can be 
derived from ii) and the Jensen inequality for the function x^. □ 

Theorem\^ The following inequality follows from the nonnegativity of the 
KL divergency (see whereas the equality is obvious. 

E{io^ + i^dog , + . . . < ^(log ZM) 

lJ'U{xi\yi) Atc/(x2|(xi,yi),y2) mKVi) 

. P{xi\yi) ^ P{y2\{xi,yi) ^ . -P(x2|(xi,yi),j/2) ^ , 
+E{\og — + S(log — -) + log — —— ^ + . . . 

^^u{xl\yl) /i[/(y2|(xi,yi) /"i/(x2Kxi, yi), ^2) 

pn^„ -P(^i'^i) ^ , wn „ -^((^2, y2)|(xi, yi)) ^ 
= ^(log — 7 r + log — + .... 

IJ-u{xi,yi) /xt/((x2,y2)|(xi,yi)) 

Now we can apply the first statement of the previous theorem to the last 
sum as follows: 

1 P{xi,yi) ^ P{{x2,y2)\{xi,yi)) ^ 
hm log — + log — -J- + ... 

t^oot m{xi,yi) ^iu{{x2,y2)\{xi,yi)) 

^(^j^g P{{xt,yt)\{xi,yi) ...{xt-i,yt-i)) ^ ^ ^ 
IJ'u{{xt,yt)\ixi,yi) . . . {xt-i,yt-i)) 

From this equality and the last inequality we obtain the proof of i). The 
proof of the second statement can be obtained from the similar representa- 
tion for ii) and the second statement of the theorem 4. iii) can be derived 
from ii) and the Jensen inequality for the function x^. □ 
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Lemmam . First we show that for any source 9* G Mo(A) and any words 

U>-|^ . . .O/^^ , O/ (y^^ , 

where t = "^l^iU. Here the equahty holds, because 6* £ Mq{A) . The 
inequahty follows from Claim 1. Indeed, if p{a) = Vx^o...ox^{^) h and q{a) = 
0*ia), then 

From the latter inequality we obtain (j5ip . Taking into account the definition 
(f34]) and (f5T]l . we can see that the statement of Lemma is true for this 
particular case. 

For any 9 £ Mm{A) and x = xi . . . Xg, s > m, we present 9{xi . . . Xg) as 
9{xi...Xs) = 9{xi...x^) II J] e(a/n)'^-("") , 

where 9{xi . . . Xm) is the limiting probability of the word xi . . . Xm- Hence, 
9{xi . . . Xs) < YlueA"^ YlaeA 9{a/uY''^'^"'^ . Taking into account the inequality 
(ISTTl . we obtain fi^^^ 9{a/uY-^'"'^ < X[^^j^{vx{ua) / v.^iu))"-'^''"''^ for any word 
u. Hence, 

9{xi...xg)< n n e(a/n)^-("") 

ugA™ agA 

ugA™- aGA 

If we apply those inequalities to 9{x^ o ... o x^), we immediately obtain the 
following inequalities 

9{x^o...ox'') < Yl Yl9{a/uY-'o.....riua) ^ 

ueA"^ aeA 

ueA"^ agA 

Now the statement of the Lemma follows from the definition (j34p . □ 
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Theoreml^ Let Cq be a critical set of the test T^'^{A,a), i.e., by defini- 
tion, Ca = {u : u £ &i — log7r(n) — > —logo}. Let fi^ be 
a measure for which the claim 2 is true. We define an auxiliary set Ca 
= {u : -log7r(ii) - (- log/i<^(ii)) > -logo}. We have 1 > ^^^6^ 
— S«gC ^l^)/*^ = i^/ct)'^{Ca)- (Here the second inequality follows from 
the definition of Ca, whereas all others are obvious.) So, we obtain that 
T^iCa) ^ Oi. From definitions of Ca, Ca and (j26p we immediately obtain that 
Ca ^ Ca- Thus, 7r(Ca) < a. By definition, Tr{Ca) is the value of the Type I 
error. The first statement of the theorem is proven. 

Let us prove the second statement of the theorem. Suppose that the 
hypothesis H\'^{A) is true. That is, the sequence xi . . .xt is generated by 
some stationary and ergodic source r and t ^ tt. Our strategy is to show 
that 

lim — log7r(xi . . . xt) — \(p{xi . . . xt)\ = oo (52) 

i— >oo 

with probability 1 (according to the measure r). First we represent (j52p as 
-log7r(xi ...Xt) - \(p{xi . ..xt)\ 

= ti- log — — — — - + -(- logr(xi ...Xt) - \ip{xi . . . xt)\)). 
t vr(xi . . . Xt) t 



From this equality and the property of a universal code (j29p we obtain 

- log7r(xi ...Xt)- |(^(xi ...xt)\=t{^ log ^['^^•••'^^i + o(l)). (53) 

t TT[Xl...Xt) 



From ()29p and (j2ip we can see that 

lim -logT{xi .. .xt)/t < hkir) (54) 

t—too 

for any k > (with probability 1). It is supposed that the process vr has a 
finite memory, i.e. belongs to Ms{A) for some s. Having taken into account 
the definition of Ms{A) (jlSp . we obtain the following representation: 

t 

-log7r(xi . ..xt)/t = -t"^ ^log7r(xi/xi . ..Xi-i) 

4 = 1 

k t 
= -t~^C^logTT{Xi/xi...Xi-l)+ ^ log7r(Xi/Xi_fc . . .Xi_l)) 
1=1 i=k+l 
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for any k > s. According to the ergodic theorem there exists a limit 

t 

lim log'K{xi/xi^k ■ ■ -Xi^i), 

i=fc+l 

which is equal to /ifc(r) [5l E] . So, from the two last equalities we can see 
that 

lim (— log7r(xi . . . Xt))/t = — > t{v) > ria/v) log -nia/v). 

Taking into account this equality, ()54p and (I53p . we can see that 

-log7r(a;i . . .xt)-|99(xi . . . xt)\ > * ( ^ T(t;) ^ T{a/v)\og{T{a/v)/TT{a/v)))+o{t) 

for any k > s. From this inequality and Claim [1] we can obtain that 
-log7r(xi ...Xt) - \(p{xi . ..xt)\ >ct + o{t) 



, where c is a positive constant, t — > oo. Hence, (l52]l is true and the theorem 
is proven. □ 

Theorem^. Let us denote the critical set of the test T^^{A,a) as Ca, 
i.e., by definition, Ca = {xi . . .xt : {t — m) hl^{xi . . . xt) — \(p{xi...xt)\) > 
log(l/Q)}. From Claim [2] we can see that there exists such a measure fi^p 
that —logfj,ip{xi...xt) < \ip{xi...xt)\ ■ We also define 

Ca = {xi . . . Xt : (t- m) h*^{xi . . . xt) - {- log fi^{xi...xt)) ) > log(l/a)}. 

(55) 

Obviously, Ca 3 Ca- Let 9 be any source from Mm{A). The following chain 
of equalities and inequalities is true: 

l>H^{Ca)= ^ n^{xi...xt) 

Xi...Xt&Ca 

>a-^ 2(*-™)'*™(^i-^') > 0{xi...xt) = e{Ca). 

Xi...Xt£Ca Xi...Xt&Cci 

(Here both equalities and the first inequality are obvious, the second and 
the third inequalities follow from (j55p and the Lemma, correspondingly.) So, 
we obtain that 9{Ca) < a for any source 6 S Mm{A). Taking into account 
that Ca 3 Ca, where Ca is the critical set of the test, we can see that the 
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probability of the Type I error is not greater than a. The first statement of 
the theorem is proven. 

The proof of the second statement wih be based on some results of In- 
formation Theory. We obtain from (j29p that for any stationary and ergodic 
P 

lim t~'^\ip{xi...xt)\ = hoo{p) (56) 

t~*oo 

with probability 1. It can be seen from (j23p that h"^ is an estimate for the 
m— order Shannon entropy (j20p . Applying the ergodic theorem we obtain 
liuit^oo h'^{xi ■ ■ ■ xt) = hmip) with probability 1 O IH] . It is known in 
Information Theory that hm{Q) — hoo{Q) > 0, if ^ belongs to Moo{A)\Mm{A) 
O HU . It is supposed that Hf^ is true, i.e. the considered process belongs 
to Moo(^) \ Mm{A). So, from (f56l) and the last equality we obtain that 
limt^oo((i — fn) ■ ■■^t) — W{xi---Xt)\) = CO. This proves the second 

statement of the theorem. □ 

Theorem\^ First we prove that with probability 1 there exists the follow- 
ing limit lim^^oo 7 log(p(xi . . . xt)/ru{xi . . . xtj) and this limit is finite and 
nonnegative. Let An = {xi, . . . , a;„ : p(xi, . . . , x„) 7^ 0}. Define 

Zn{xi ...Xn)= ru{xi . . . Xn)/p{xi . . . X„) (57) 

for (xi, . . . , Xn) S A and Zn = elsewhere. 
Since 

TP t I \ TP ,'ru{xi...Xn) 

Ep{Zn\Xl, . . .,Xn-l) = E 



p{xi ...Xn) 



X\ , . . . , Xn—l 



r[/(xi . . .x„_i) ( ru{xn\xi. . .Xn^l) 
-h/p 



Zn~l 



p{xi...Xn-l) \ p{Xn\xi . . . Xn-l) 

ru{Xn\xi . . . Xn-l)dP{Xn\xi . . . Xn-l) 



n\Xl . . . Xn—l )/dMn{x n\Xl . . . Xn—l ) 
Zn-1 / ru{Xn\xi...Xn-l)dMn{Xn\xi...Xn-l) <Zn-l 



A 

the stochastic sequence {zn,B^) is, by definition, a non-negative super- 
martingale with respect to P, with E{zn) < 1, |49j . Hence, Doob's sub- 
martingale convergence theorem implies that the limit Zn exists and is finite 
with P— probability 1 (see [49^ Theorem 7.4.1]). Since all terms are non- 
negative so is the limit. Using the definition (j57p with P-probability 1 we 
have 

lim p{xi . . . Xn)/ru{xi . . . x„) > 0, 
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lim log(p(xi . . . Xn)/ru{xi . . . Xn)) > -oo 

n— >oo 

and 

lim log(p(xi . . . Xn)/ruixi . . . > 0. (58) 

n— >oo 

Now we note that for any integer s the following obvious equality is true: 
ru{xi . . .Xt) = 0JslJ-u{x^i ■ ■ ■ xY^) / Mt{x^i . . . xY^) + for some 5 > 0. From 
this equality, ([3T]) and ([ISj) we immediately obtain that a.s. 



1 1 P{xi ■■■Xt) -logWi 
lim — log — r- < iim 

t^oo t rij{xi...Xt) t^oo t 

, 1- 1, p{xi...xt) 
+ lim - log n n n rr 

fiu{xY ■..xY)lMt{xY ...xY) 

< lim - log ^(^i---^*) (59) 

-t^^t 2-l^(4°'-4^')l/Mt(xW...xW) 



The right part can be presented as follows: 

lim 1 log. Pi^^---^t) 



lim 1 log P'i-i----t)Mt{x ...Xt ^ 
+ limilog.^(^^---^*^ 



t-»oo t P^{xi . . .Xt) 

Having taken into account that C/ is a universal code, (j40p and the theorem[2l 
we can see that the first term is equal to zero. From ()39p and (|42p we can 
see that a.s. the second term is equal to hg — h. This equality is valid for 
any integer s and, according to (HH), the second term equals zero, too, and 
we obtain that 

1- li p{xi...xt) 

lim - log — < 0. 

t^cot ru{xi...xt) 



Having taken into account (j58p . we can see that the first statement is proven. 
From ([7]) and ([7j) we can can see that 

E log < E ^ Pti-i^---^^t)Mt{xY...x?) 

ruixi...xt) - ^ 2-|f^(4^'-4^')l 

+^log (61) 
p^{xi, ...,Xt) 
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The first term is tiie average redundancy of the universal code for a finite- 
alphabet source, hence, according to the theorem [2l it tends to 0. The 
second term tends to hg — h for any s and from (j44p we can see that it is 
equals to zero. The second statement is proven. □ 



Theorem\^ Obviously, 
t 



1 

) — Ru{Xm+l S Bm+ i\xi...x^)f) < (62) 



m=0 
t-1 

t 



1 

- ^ E{\P{Xm+l G Bm+l\xi...Xm) - Ru{Xni+l G ^m+l ^1 . . -X^) | + 
m=0 

\P{Xm+l G Bm+l\xi...Xm) - Ru{Xni+l G -Bm+1 • • .a^m) | )^ • 

From the Pinsker inequality ([5]) and convexity of the KL divergence ([6]) we 



obtain the following inequalities 
t 



1 

- Emx m+l G Bm+l\ ) - Ru{Xm+l G Bm+l\xi...Xm)\+ (63) 



m=0 



|P(Xm+l G 5m+lkl...Xm) - i?(7(Xm+i G Bm+l\xi...Xm)\f < 



const ^Sr^ P{Xm+l £ Bm+l\xi...Xm) P{Xm+l ^ Bm+l\xi...Xm) . . 

* -R(7(Xm+l G Bm+l\xi...Xm) Ru{Xm+l G -Bm+ 1 1^1 • • •^m) 

COTliSt » /* /* p(x -\-'\_ I 1 

: / A / P(xi...Xm)( / p{Xm+l\xi...Xm))log --^ dM)dMm). 

t n J J ru[Xm+l\Xl...Xm) 

m=U 

Having taken into account that the last term is equal to '^°"^* £'(log 

from (j62|) . (|63|) and (|46|) we obtain (|47|) . ii) can be derived from i) and the 

Jensen inequality for the function x^. □ 

Theorem \ 101 The last inequality of the following chain follows from the 
Pinsker's one, whereas all others are obvious. 



(/ fix)p{x\xi...Xm)dMm-J /(x) Tf/ (x|xi . . .X^) dM^)" 

/(x) (p(x|xi...Xm,) - ru{x\xi...Xm)) dMmf 
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<M{j {p{x\xi...Xm) - ru{x\xi...Xra)) dMm) 

< j \p{x\xi...Xm) - ruix\xi...Xm)\dMmf 

^ , /■ / I N 1 p{x\xi...Xm) , 

< const / pix\Xl...Xm) log ^ rdMra- 

J ru{x\xi...Xm) 
From these inequalities we obtain: 



t-i 



Xl . . .XfYi )dMm- 

f{x)ru{x\x^...x^)dM^^f)< (64) 



const E[ I p[x\xi...Xm ) log , , '-dMiu^i) 

J ru{x\xi...Xm) 



m=0 

The last term can be presented as follows: 



Y,E{f p{x\x,...x^) logJP^f^^^^^dMy^) = 
J ru[x\xi...Xm) 

t-1 » 

^ / p{xi...Xm) 

PyX\X\ . . 'X^ 

log P{^\xi...Xm) ^^^^ 

ru[x\xi...Xm) 

p{xi...xt) \og{p{xi...Xt)/ru{xi...Xt))dMt. 

From this equality, ^ and Corollary 1 we obtain ([18|) . ii) can be derived 
from d?]) and the lensen inequality for the function x^ . □ 

Theorem The following chain proves the first statement of the theorem: 

oo 

P{Hq{A) is rejected / HQtstrue} = P{|^{i?o (^j) is rejected / Hoistrue}} 

i=l 

oo oo 

< P{Ho{Ai) /Ho is true} < Y^i^^i) = «• 

i=l i=l 
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(Here both inequahties fohow from the description of the test, whereas the 
last equality follows from ([Ml)-) 

The second statement also follows from the description of the test. In- 
deed, let a sample is created by a source q, for which Hi{A)^ is true. It is sup- 
posed that the sequence of partitions A discriminates between Hq{A), Hf{A). 
By definition, it means that there exists j for which i/^(Aj) is true for the 
process q\. . It immediately follows from Theorem [T] - H] that the Type 
II error of the test T^{Aj,aiOj) goes to 0, when the sample size tends to 
infinity. □ 
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